Packages

The needed packages:

library(sf)
library(stars)
library(tidyr)
library(magrittr)
library(purrr)
library(dplyr) # best to load last
library(reshape2)
library(data.table)
library(ggplot2)
library(lubridate)

Read in local data sets

setwd("~/Desktop/OUCRU/FacebookData/ColocationMarc")
dist_polygons <- readRDS(GADM_census_path)
coloc_mat <- readRDS(coloc_mat_path)
vn0 <- readRDS(GADM_country_path)
worldpop <- read_stars("VNM_ppp_v2b_2020_UNadj.tif")
colocation <- readRDS(colocation_path)

Redefining the hist() function:

hist2 <- function(...) hist(..., main = NA)

SEIR Metapopulation Contact Model

The following model uses colocation data to formulate the coupling in a SEIR metapopulation model. The equations that describe the epidynamics in each of the populations is given below: \[ \begin{aligned} \frac{dS_i}{dt} & = - S_i \sum_{j} \beta_{ij} \frac{I_j}{N_j} \\ \frac{dE_i}{dt} & = S_i \sum_{j} \beta_{ij} \frac{I_j}{N_j} - \sigma E_i \\ \frac{dI_i}{dt} & = \sigma E_i - \gamma I_i \\ \frac{dR_i}{dt} & = \gamma I_i \\ \end{aligned} \] where \(\frac{1}{\sigma} = 7\) days is the latency period and \(\frac{1}{\gamma} = 7\) is the recovery period. Note that \(N_i = S_i + E_i + I_i + R_i\). We assume that \(\beta_{ij}\) which represents the transmission rate from population \(j\) to population \(i\) is proportional to \(C_{ij}\), the colocation probability for population \(i\) and population \(j\).

Parameterization Method 1

The following parameterization is an adaptation of the contact model given in the paper, Modeling the impact of human mobility and travel restrictions on the potential spread of SARS-CoV-2 in Taiwan.

The \(\beta_{ij}\) are defined as follows: \[ \begin{aligned} \beta_{ij} = \gamma R_{0 ij} \end{aligned} \] where \[ \begin{aligned} R_{0 ij} & = R_{0 \: Hanoi'} \frac{C_{ij}}{C_{Hanoi'-Hanoi'}} \\ \end{aligned} \] and \(Hanoi'\) is the district in Hà Nội with the highest density. This assumes that the strength of transmission from district \(j\) to district \(i\) is dependent on colocation probabilities and independent of the relative population sizes of the interacting regions.

Let’s calculate the district in Hanoi with the highest density.

HanoiDist <- dist_polygons %>%
  filter(province == "Hà Nội") %>% 
  arrange(desc(den_km2)) %>%
  head(1) %>%
  pull(district)

HanoiDistID <- dist_polygons %>%
  filter(province == "Hà Nội") %>%
  filter(district == HanoiDist) %>% 
  pull(polygon_id)

As such, we will set the \(R_0\) of Hoàn Kiếm to 18; that is, \(R_{0 \: Hanoi'} = 18\).

The following function computes the \(R_{0 ij}\) values with the standard reference district with \(R_0\) of standardR0 and district ID of standardDistID.

compute_R0 <- function(coloc_mat, standardR0, standardDistID) {
  melt(coloc_mat) %>% 
    setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
    mutate(R0 = standardR0 * link_value / coloc_mat[paste(standardDistID), paste(standardDistID)])
}

Let’s compute the \(R_{0 ij}\) with \(R_{0 \: Hanoi'} = 18\):

R0 <- compute_R0(coloc_mat, 18, HanoiDistID)

Intracity \(R_0\)

Intracity \(R_0\) is defined as \(R_ii\).

Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.

intracityR0 <- R0 %>%
  filter(polygon1_id == polygon2_id) %>%
  rename(intraR0 = R0)
hist2(intracityR0$intraR0, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 60, 10))
axis(2)

cb <- RColorBrewer::brewer.pal(9, "YlOrBr")
color_generator <- colorRampPalette(cb)
pal <- color_generator(10)
hist2(intracityR0$intraR0, quantile(intracityR0$intraR0, seq(0, 1, le = 11)), col = pal, axes = FALSE,
      xlab = "Intracity R0", ylab = "density of probability")
axis(1, seq(0, 60, 10))
axis(2)

quantile(intracityR0$intraR0, seq(0, 1, le = 11))
##        0%       10%       20%       30%       40%       50%       60%       70% 
##  0.000000  1.586063  2.008983  2.373354  2.887077  3.456204  4.453696  6.125704 
##       80%       90%      100% 
##  8.322006 13.101921 55.396777

Surprisingly, we see that there are several districts with intracity \(R_0\) values greater than 20. This is far higher than what you would expect for measles. Let’s see what cities in particular have these high intracity \(R_0\) values.

Let’s map the intracity \(R_0\) values for the districts:

intracityR0sf <- dist_polygons %>%
  right_join(select(intracityR0, -polygon2_id), c("polygon_id" = "polygon1_id"))

The following district have intracity \(R0\) greater than 10.

vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

intracityR0sf %>%
  select("intraR0") %>% 
  filter(intraR0 > 10) %>%
  plot(add = TRUE, col = "red")

The following districts have intracity \(R0\) greater than 20.

intracityR0sf %>% 
  filter(intraR0 > 20) %>%
  arrange(desc(intraR0)) %>%
  st_drop_geometry() %>%
  select(province, district, intraR0)
##       province       district  intraR0
## 1      Yên Bái       Trạm Tấu 55.39678
## 2     Hà Giang   Hoàng Su Phì 42.67776
## 3    Quảng Nam      Tây Giang 42.55601
## 4   Quảng Ngãi        Tây Trà 41.49006
## 5      Kon Tum      Kon Plông 33.77775
## 6    Thanh Hóa      Mường Lát 32.71860
## 7       Sơn La        Sốp Cộp 32.20555
## 8    Quảng Nam      Phước Sơn 31.46382
## 9     Hà Giang        Xín Mần 31.07934
## 10    Hà Giang       Đồng Văn 29.85194
## 11  Quảng Ngãi         Lý Sơn 29.61061
## 12    Hà Giang        Mèo Vạc 28.03566
## 13    Cao Bằng        Bảo Lạc 27.90237
## 14   Quảng Nam     Nam Trà My 26.41899
## 15    Lạng Sơn       Đình Lập 26.37892
## 16    Cao Bằng        Bảo Lâm 24.83981
## 17      Sơn La        Bắc Yên 24.52896
## 18  Quảng Ninh         Ba Chẽ 23.66076
## 19  Quảng Ninh      Bình Liêu 23.55896
## 20   Điện Biên       Tủa Chùa 23.15901
## 21    Cao Bằng     Thông Nông 23.08779
## 22   Điện Biên Điện Biên Đông 22.44800
## 23  Bình Thuận        Phú Quí 22.33890
## 24 Hồ Chí Minh              4 22.28844
## 25     Gia Lai        Ayun Pa 22.03421
## 26    Hà Giang        Quản Bạ 20.98567
## 27 Hồ Chí Minh              3 20.84098
## 28     Lào Cai      Si Ma Cai 20.67039
## 29   Quảng Trị      Quảng Trị 20.22414
vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

intracityR0sf %>% 
  filter(intraR0 > 20) %>% 
  select(intraR0) %>% 
  plot(add = TRUE, col = "red")

It appears that remote regions of Vietnam have relatively higher intracity \(R_0\) values. Let’s verify this by graphing relationship between population density and intracity \(R_0\) values under the assumption that more remote regions will have lower population densities.

ggplot(intracityR0sf, aes(x = log10(den_km2), y = intraR0)) + geom_point()

Both populations with relatively low population densities and populations with relatively high population densities have high intracity \(R_0\) values. Populations with especially low densities have the greatest intracity \(R_0\) values. Note that we should see a positive relationship between population density and intracity \(R_0\) values. The intracity \(R_0\) values need to be adjusted to fix the artifact of how the colocation matrix is constructed.

Intercity \(R_0\)

Intercity \(R_0\) for location \(i\) is defined as \(\sum_{j \neq i} R_{0 ij}\).

Let’s consider the intercity R0 values:

intercityR0 <- R0 %>%
  filter(polygon1_id != polygon2_id) %>%
  rename(interR0 = R0)

totalInter <- intercityR0 %>%
  group_by(polygon1_id) %>%
  summarize(interR0 = sum(interR0)) %>%
  arrange(desc(interR0))
## `summarise()` ungrouping output (override with `.groups` argument)

Let’s map the intracity \(R_0\) values for the districts: The following districts have intercityR0 values greater than 1:

intercityR0sf <- dist_polygons %>%
  right_join(totalInter, c("polygon_id" = "polygon1_id"))
vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

intercityR0sf %>%
  select("interR0") %>% 
  filter(interR0 > 1) %>%
  plot(add = TRUE, col = "red")

Let’s graph the relationship between population density and intercity \(R_0\) values:

ggplot(intercityR0sf, aes(x = log10(den_km2), y = interR0)) + geom_point()

As we would expect, population centers with higher population densities are better connected with surrounding regions resulting in higher intercity \(R_0\) values. Hence, the intercity \(R_0\) values under this model are consistent with epidemiological assumptions.

Intercity and Intracity \(R_0\) values

Let’s graph the relationship between intracity \(R_0\) values and intercity \(R_0\) values:

tmp <- intercityR0sf %>%
  right_join(select(intracityR0, -polygon2_id, -link_value), c("polygon_id" = "polygon1_id"))
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
  geom_abline(slope = 1, intercept = 0)

There are two distinct groups in the plot: one group has larger intercity \(R_0\) values relative to intracity \(R_0\) values, while the other has larger intracity \(R_0\) values relative to intercity \(R_0\) values.

Let’s use a linear mixture model to assign each point to a particular linear model:

library(flexmix)
## Loading required package: lattice
set.seed(10)
model <- flexmix(interR0 ~ intraR0, tmp, k = 2)
summary(model)
## 
## Call:
## flexmix(formula = interR0 ~ intraR0, data = tmp, k = 2)
## 
##        prior size post>0 ratio
## Comp.1  0.78  574    643 0.893
## Comp.2  0.22  124    666 0.186
## 
## 'log Lik.' -498.5177 (df=7)
## AIC: 1011.035   BIC: 1042.873
plot(tmp$interR0, tmp$intraR0, col = clusters(model), 
     xlab = "intercity R0", ylab = "intracity R0")

Let’s map the clustering observed in the linear mixture model to the districts:

vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

colors <- clusters(model)
colors[colors == 1] = 3

tmp %>% 
  transmute(cluster = clusters(model)) %>%
  st_geometry() %>%
  plot(add = TRUE, col = colors)

It is a little noisy, so let’s take a closer look at the districts that are “clearly” in one group or the other; that is, districts with posterior probabilities that indicate clear membership to one of the linear regression models.

postProb <- posterior(model) %>%
  as.data.frame() %>%
  transmute(model1 = V1, model2 = V2, cluster = clusters(model)) %>%
  mutate(m1Vm2 = model1/model2, m2Vm1 = model2/model1)
tmpPost <- tmp %>% 
  transmute(cluster = postProb$cluster, m1Vm2 = postProb$m1Vm2, 
            m2Vm1 = postProb$m2Vm1)
vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

tmpPost %>% 
  filter(cluster == 1) %>%
  arrange(desc(m1Vm2)) %>%
  head(50) %>%
  st_geometry() %>%
  plot(add = TRUE, col = 3)

tmpPost %>% 
  filter(cluster == 2) %>%
  arrange(desc(m2Vm1)) %>%
  head(50) %>%
  st_geometry() %>%
  plot(add = TRUE, col = 2)

The districts with high intercity \(R_0\) relative to intracity \(R_0\) are clustered around Hanoi and Ho Chi Minh, while the districts with low intercity \(R_0\) relative to intracity \(R_0\) are clustered in remote regions.

Parameterization Method 2

The \(\beta_{ik}\) are defined as follows: \[ \begin{aligned} \beta_{ij} = \gamma R_{0 ij} \end{aligned} \] where \[ \begin{aligned} R_{0 ij} & = R_{0 \: Hanoi'} \frac{C_{ij} N_j}{C_{Hanoi'-Hanoi'} \cdot N_{Hanoi'}} \\ \end{aligned} \] and \(Hanoi'\) is the district in Hà Nội with the highest density. The numerator represents the average number of individuals from population \(j\) that colocate with an individual from population \(i\) per unit time, whereas the denominator is the average number of colocations among individuals in the densest district in Hanoi per unit time. This formulation scales the colocation probability by the population size of the interacting district.

The following function computes the \(R_{0 ij}\) values with the standard reference district with \(R_0\) of standardR0 and district ID of standardDistID.

compute_R0 <- function(dist_data, coloc_mat, standardR0, standardDistID) {
  popHanoiDist <- dist_data %>% 
    filter(polygon_id == HanoiDistID) %>% 
    pull(n)
  
  coloc_mat <- reshape2::melt(coloc_mat) %>% 
    setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
    dplyr::right_join(select(as.data.frame(dist_data), polygon_id, n), 
                      c("polygon2_id" = "polygon_id")) %>%
    dplyr::mutate(R0 = standardR0 * link_value * n / 
             (coloc_mat[paste(standardDistID), paste(standardDistID)] * popHanoiDist))
}

Let’s compute the \(R_{0 ij}\) with \(R_{0 \: Hanoi'} = 18\):

R0 <- compute_R0(dist_polygons, coloc_mat, 18, HanoiDistID)

Intracity \(R_0\)

Intracity \(R_0\) is defined as \(R_ii\).

Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.

intracityR0 <- R0 %>%
  filter(polygon1_id == polygon2_id) %>%
  rename(intraR0 = R0)
hist2(intracityR0$intraR0, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 20, 2))
axis(2)

cb <- RColorBrewer::brewer.pal(9, "YlOrBr")
color_generator <- colorRampPalette(cb)
pal <- color_generator(10)
hist2(intracityR0$intraR0, quantile(intracityR0$intraR0, seq(0, 1, le = 11)), col = pal, axes = FALSE,
      xlab = "Intracity R0", ylab = "density of probability")
axis(1, seq(0, 20, 2))
axis(2)

quantile(intracityR0$intraR0, seq(0, 1, le = 11))
##        0%       10%       20%       30%       40%       50%       60%       70% 
##  0.000000  1.007264  1.242158  1.464806  1.650588  1.908886  2.221608  2.554752 
##       80%       90%      100% 
##  3.258339  4.705081 18.142725

The intracity \(R_0\) values are far more reasonable for measles.

Let’s map the intracity \(R_0\) values for the districts:

intracityR0sf <- dist_polygons %>%
  right_join(select(intracityR0, -polygon2_id, -n), c("polygon_id" = "polygon1_id"))

The following district have intracity \(R0\) greater than 5.

vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

intracityR0sf %>%
  select("intraR0") %>% 
  filter(intraR0 > 5) %>%
  plot(add = TRUE, col = "red")

While there is slightly more clustering around the population centers of Hanoi and Ho Chi Minh, there is still high intracity \(R_0\) values in relatively remote regions.

Let’s explore the relationship between population density and intracity \(R_0\) values under the assumption that more remote regions will have lower population densities.

ggplot(intracityR0sf, aes(x = log10(den_km2), y = intraR0)) + geom_point()

This is a large improvement from the previous model, but there are still districts with low densities that have high intracity \(R_0\) values. One would assume that remote regions with low densities would have smaller within-district colocation probabilities because people are so sparsely located and rarely interact. This property, however, may be due to the fact that the majority of the population within these small districts reside only a small area of the much larger district; therefore, although the district has a low population density overall, the areas where residents actually reside may be quite dense. This could result in a high within-district colocation probability and consequently a high intracity \(R_0\). This hypothesis will be tested below.

Intercity \(R_0\)

Intercity \(R_0\) for location \(i\) is defined as \(\sum_{j \neq i} R_{0 ij}\).

Let’s consider the intercity R0 values:

intercityR0 <- R0 %>%
  filter(polygon1_id != polygon2_id) %>%
  rename(interR0 = R0)

totalInter <- intercityR0 %>%
  group_by(polygon1_id) %>%
  summarize(interR0 = sum(interR0)) %>%
  arrange(desc(interR0))
## `summarise()` ungrouping output (override with `.groups` argument)

Let’s map the intracity \(R_0\) values for the districts: The following districts have intercityR0 values greater than 1:

intercityR0sf <- dist_polygons %>%
  right_join(totalInter, c("polygon_id" = "polygon1_id"))
vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

intercityR0sf %>%
  select("interR0") %>% 
  filter(interR0 > 1) %>%
  plot(add = TRUE, col = "red")

The districts with high intercity \(R_0\) values are clustered in high density/highly connected regions including Hanoi and Ho Chi Minh, which is what we would expect.

Let’s graph the relationship between population density and intercity \(R_0\) values:

ggplot(intercityR0sf, aes(x = log10(den_km2), y = interR0)) + geom_point()

As we would expect, population centers with higher population densities are better connected with surrounding regions resulting in higher intercity \(R_0\) values. Hence, the intercity \(R_0\) values under this model are consistent with epidemiological assumptions. Implicit in this analysis is the assumption that districts with higher population densities are better connected to surrounding regions. In general, this assumption holds because districts which have high population densities tend to have advanced transport networks that facilitate travel between other districts.

Intercity and Intracity \(R_0\) values

Let’s graph the relationship between intracity \(R_0\) values and intercity \(R_0\) values:

tmp <- intercityR0sf %>%
  right_join(select(intracityR0, -polygon2_id, -link_value), c("polygon_id" = "polygon1_id"))
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
  geom_abline(slope = 1, intercept = 0)

The grouping that we observed in the first model has largely disappeared. There are a few disticts with high intracity \(R_0\) values in comparison to intercity \(R_0\) which may be a point of concern. As we would expect, however, the intracity \(R_0\) values are weakly proportional to the intercity \(R_0\) values.

Risk of Infection

Let’s define the risk of infection for a location \(i\) as follows: \[ \begin{aligned} \text{Risk of infection for location i} = \frac{\sum_{j \: includes \: i } R_{0 ij}}{\text{max}_l(\sum_{j \: includes \: l} R_{0 lj})} \end{aligned} \] The sum of intracity \(R_0\) and intercity \(R_0\) reflects total risk of infection and was standardized to the highest sum of intracity \(R_0\) and intercity \(R_0\).

Let’s calculate it for all districts:

ROIdf <- R0 %>% 
  group_by(polygon1_id) %>%
  summarize(ROI = sum(R0)) %>%
  rename(polygon_id = polygon1_id) %>%
  arrange(desc(ROI)) %>%
  mutate(ROI = ROI / head(., 1)$ROI)
## `summarise()` ungrouping output (override with `.groups` argument)
ROIsf <- dist_polygons %>%
  right_join(ROIdf, "polygon_id")

It look like this:

ROIsf %>%
  arrange(desc(ROI)) %>%
  select(province, district, den_km2, ROI) %>%
  st_drop_geometry() %>%
  head(30)
##       province     district    den_km2       ROI
## 1  Hồ Chí Minh            3 40590.0774 1.0000000
## 2  Hồ Chí Minh           10 41979.8331 0.9462927
## 3       Hà Nội    Hoàn Kiếm 44315.7927 0.9253216
## 4  Hồ Chí Minh            5 41120.6293 0.8626589
## 5  Hồ Chí Minh           11 41427.8192 0.8570843
## 6  Hồ Chí Minh            4 41772.4692 0.8570194
## 7  Hồ Chí Minh    Phú Nhuận 37221.5664 0.8359342
## 8  Hồ Chí Minh            1 20906.4535 0.7298933
## 9  Hồ Chí Minh   Bình Thạnh 24932.7512 0.7215633
## 10 Hồ Chí Minh            6 36560.3898 0.7173413
## 11 Hồ Chí Minh      Tân Phú 32256.2361 0.7099070
## 12      Hà Nội Hai Bà Trưng 32389.5908 0.6873188
## 13      Hà Nội      Đống Đa 36694.1023 0.6848160
## 14 Hồ Chí Minh       Gò Vấp 33762.4512 0.6666487
## 15 Hồ Chí Minh     Tân Bình 17625.4264 0.6195264
## 16      Hà Nội   Thanh Xuân 30377.4820 0.5958250
## 17      Hà Nội      Ba Đình 26681.5236 0.5931674
## 18 Hồ Chí Minh            8 21146.4587 0.5394637
## 19    Hà Giang Hoàng Su Phì   104.9743 0.4910995
## 20     Đà Nẵng    Thanh Khê 21657.0815 0.4828190
## 21      Hà Nội     Cầu Giấy 19577.0573 0.4653514
## 22 Hồ Chí Minh            7  9740.5280 0.4205578
## 23    Hà Giang     Đồng Văn   174.0675 0.4133473
## 24   Hải Phòng      Lê Chân 18121.3750 0.4050484
## 25    Hà Giang      Mèo Vạc   147.2357 0.4035337
## 26      Hà Nội    Hoàng Mai  9864.9833 0.3924667
## 27     Đà Nẵng     Hải Châu  9491.6448 0.3900344
## 28 Hồ Chí Minh     Bình Tân 11968.4052 0.3760759
## 29    Hà Giang      Xín Mần   114.7977 0.3608692
## 30      Hà Nội       Tây Hồ  6257.0554 0.3475357

The districts with the highest risk of infection are located in Hanoi and Ho Chi Minh provinces. There are a few districts with very low population densities (Ha Giang Province in Northern Vietnam) that have relatively high risks of infection, which are presumably attributable to the fact that they are concentrated within a small region of the much large district and do not travel outside the population leading to very high \(C_{ii}\) values in the colocation matrix. This is tested below.

hist2(ROIsf$ROI, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 1, 0.1))
axis(2)

Let’s map the risk of infection to the districts:

ROIsf %>%
  st_geometry() %>%
  plot(lwd = .1, col = pal[cut(ROIsf$ROI, quantile(ROIsf$ROI, seq(0, 1, le = 11)))], main = NA)

vn0 %>%
  st_geometry() %>%
  plot(add = TRUE)

Let’s examine the relationship between population density and risk of infection:

ggplot(ROIsf, aes(x = log10(den_km2), y = ROI)) + 
  geom_point() +
  geom_point(data = filter(ROIsf, province == "Hà Giang"), aes(x= log10(den_km2), y = ROI), color='red')

Generally, risk of infection increases with population density. We see, however, that some populations with low densities have high risks of infection. In particular, the districts in the Hà Giang province, which have particularly high ROI for the relatively low population densities are highlighted red.

Let’s break down the risk of infection into risk of infection due to intracity transmission and risk of infection due to intercity transmission for each district.

tmp <- totalInter %>%
  right_join(select(intracityR0, polygon1_id, intraR0, n), "polygon1_id") %>%
  rename(polygon_id = polygon1_id) %>%
  mutate(interROI = interR0 / (interR0 + intraR0)) %>%
  filter(interROI != "NaN")

interROI <- dist_polygons %>%
  right_join(select(tmp, polygon_id, interR0, intraR0, interROI), "polygon_id")

It look like this:

interROI %>%
  arrange(desc(interROI)) %>%
  select(polygon_id, province, district, den_km2, interROI) %>%
  st_drop_geometry() %>%
  head(30)
##    polygon_id          province     district     den_km2  interROI
## 1      834334 Bà Rịa - Vũng Tàu      Côn Đảo    81.96605 1.0000000
## 2      834482        Kiên Giang     Kiên Hải   739.64350 1.0000000
## 3      834281         Quảng Nam       Hội An  1561.87028 1.0000000
## 4      834741        Quảng Ninh        Cô Tô   122.76955 1.0000000
## 5      834792        Quảng Ninh      Vân Đồn    87.10470 1.0000000
## 6      834869           Long An      Tân Trụ   662.14180 0.6387840
## 7      834346        Tiền Giang    Tân Phước   193.30883 0.5634492
## 8      834400        Quảng Ngãi     Sơn Tịnh   433.67214 0.5612937
## 9      834248         Thanh Hóa     Đông Sơn   965.41999 0.5547872
## 10     834706       Hồ Chí Minh       Nhà Bè  1190.04327 0.5365434
## 11     834876        Tiền Giang      Chợ Gạo   855.24199 0.5346049
## 12     834487           Cần Thơ     Cái Răng  1411.69547 0.5070858
## 13     834887          Nam Định       Mỹ Lộc  1036.96940 0.5060699
## 14     834343           Long An   Châu Thành   740.58097 0.4913874
## 15     834710           Long An    Cần Giuộc   918.48350 0.4901447
## 16     834447         Vĩnh Long      Long Hồ   812.15741 0.4859359
## 17     834401        Quảng Ngãi     Tư Nghĩa   680.44044 0.4826807
## 18     834495         Hậu Giang   Châu Thành   611.27480 0.4796601
## 19     834883       Hồ Chí Minh   Bình Chánh  2011.09816 0.4789702
## 20     834349        Tiền Giang  Gò Công Tây   718.19836 0.4559759
## 21     834279           Đà Nẵng     Hòa Vang   186.92844 0.4558046
## 22     834820         Hải Phòng   Dương Kinh  1219.97605 0.4547509
## 23     834448         Vĩnh Long    Mang Thít   639.71911 0.4541046
## 24     834490           Cần Thơ   Phong Điền   835.00028 0.4478597
## 25     834296         Quảng Nam     Phú Ninh   347.33045 0.4450557
## 26     834277           Đà Nẵng Ngũ Hành Sơn  2029.21969 0.4428092
## 27     834863          Bạc Liêu     Vĩnh Lợi   429.29424 0.4411020
## 28     834691       Hồ Chí Minh            1 20906.45354 0.4408914
## 29     834379           Nghệ An  Hưng Nguyên   764.91638 0.4406954
## 30     834450         Vĩnh Long     Tam Bình   558.08908 0.4405246

Let’s map the percent ROI attributable to intercity transmission:

interROI %>%
  st_geometry() %>%
  plot(lwd = .1, col = pal[cut(interROI$interROI, quantile(interROI$interROI, seq(0, 1, le = 11)))], main = NA)

vn0 %>%
  st_geometry() %>%
  plot(add = TRUE)

Many of the districts with high ROI actually have relatively low ROI attributable to intercity transmission. This suggests that such the populations of such districts are isolated and a concentrated within a small region of the much larger district. Therefore, if a infected individual were to be introduced into a remote population with a high ROI (a rare event), the disease transmission within the district will be high because the individuals in the community are strongly connected. This will rarely happen, however, given how weak such a districts intercity interactions are with other populations.

Explorations of Hà Giang Province: high ROI and low density

The Hà Giang Province has a number of districts with high intracity \(R_0\) values in spite of their relatively low population densities; in fact, their population densities place them in the first quartile among all districts in Vietnam. One would assume that remote regions with low densities would have smaller within-district colocation probabilities because people are so sparsely located and rarely interact. The counter-intuitive property displayed by some districts in the Hà Giang Province may be attributable to the fact that the majority of the population within these districts reside only a small area of the much larger district; therefore, although the district has a low population density overall, the areas where residents actually reside may be quite dense. This could result in a high within-district colocation probability and consequently a high intracity \(R_0\).

There are 10 districts in the Hà Giang province, which is identified below:

plot(st_geometry(vn0), col = "grey")

dist_polygons %>%
  filter(province == "Hà Giang") %>%
  st_geometry() %>%
  plot(add = TRUE, col = "yellow")

AUC Explorations

In order to characterize how evenly the population is spread in the district, the actual population density of each Bing pixel is compared to the population density of each Bing pixel if the population was perfectly spread out over all the pixels in the district. This is captured by the shaded area in the following graph, where a smaller area corresponds to a more spread out population.

The following function computes the area-under-curve (AUC) statistic for the district with polygon_ID according to the worldpop data set.

compute_AUC <- function(polygon_ID, polygons = dist_polygons, rstr = worldpop) {
  dist <- polygons %>% 
    filter(polygon_id == polygon_ID)
  
  tmp <- worldpop %>%
    st_crop(dist) %>%
    st_as_stars() %>%
    unlist() %>%
    na.omit() %>%
    as.vector() %>%
    sort(decreasing = TRUE)
  
  cumsum <- (tmp / sum(tmp)) %>% cumsum() 
  (cumsum / length(cumsum)) %>% sum() - 0.5
}

Let’s compute the area-under-curve (AUC) statistic for each of the districts:

dist_AUCs <- unlist(purrr::map(pull(dist_polygons, polygon_id), compute_AUC)) %>%
  data.frame(polygon_id = pull(dist_polygons, polygon_id), AUC = .)

Let’s check whether there are any districts with no population according to WorldPop:

no_pop <- dist_AUCs %>%
  filter(AUC == -0.5) %>%
  pull(polygon_id)

dist_polygons %>%
  filter(polygon_id %in% no_pop) %>%
  select(polygon_id, province, district) %>%
  st_drop_geometry()
##   polygon_id          province     district
## 1     834334 Bà Rịa - Vũng Tàu      Côn Đảo
## 2     834302        Bình Thuận      Phú Quí
## 3     850002         Hải Phòng Bạch Long Vĩ
## 4     850004         Quảng Trị       Cồn Cỏ

Let’s map these districts without population data:

vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

dist_polygons %>%
  filter(polygon_id %in% no_pop) %>% 
  st_centroid() %>%
  st_geometry() %>%
  plot(pch = 3, col = 'red', add = TRUE)

All the districts without population data are islands off the coast of Vietnam.

Let’s remove these data points from the AUC data set.

dist_AUCs %<>% 
  filter(AUC != -0.5)

Let’s examine the distribution of the AUC statistics:

hist2(dist_AUCs$AUC, n = 50, axes = FALSE, 
      xlab = "area-under-curve (AUC)", ylab = "number of districts")
axis(1, seq(0, 0.4, 0.05))
axis(2, seq(0, 50, 10))

tmp <- ROIsf %>%
  right_join(dist_AUCs, "polygon_id")

ggplot(tmp, aes(x = log10(den_km2), y = ROI)) + geom_point(aes(color = AUC))

Those districts with high ROI (above 0.25) and low density (below 300 / km2) do not necessarily have low AUC values. As such, the hypothesis that this property is attributable to the fact that the the majority of the population is concentrated in a small area relative to the district as whole may not hold. An alternative explanation may need to be sought.

Let’s map the AUC distribution:

dist_polygons %>%
  right_join(dist_AUCs, "polygon_id") %>%
  st_geometry() %>%
  plot(lwd = .1, col = pal[cut(dist_AUCs$AUC, quantile(dist_AUCs$AUC, seq(0, 1, le = 11)))], main = NA)

vn0 %>%
  st_geometry() %>%
  plot(add = TRUE)

The darker colored districts have higher AUC values; that is, districts whose populations are less evenly spread out will be colored a darker.

Facebook Population vs. Census Population Exploration

In the above model, the census population is used rather than the facebook population even though the colocation probabilities are based on calculations using the facebook population. As such, the artifacts in the data may be attributable to the use of the census population which may be inconsistent with the facebook population.

Let’s compare the facebook population data with the census population data. The following function combines the facebook data with the GADM / census data for each week:

dist_fb_populations <- function(x) {
  x %>%
    select(polygon1_id, fb_population_1) %>% 
    distinct() %>% 
    right_join(select(st_drop_geometry(dist_polygons), -area_km2, -den_km2), c("polygon1_id" = "polygon_id"))
}

Let’s compute it:

tmp <- map(colocation, dist_fb_populations)

Let’s consider the change in facebook population for each district over time. Note that each black line represents the facebook population of a district over time.

xs <- ymd(names(colocation)) - 6
plot(xs, seq_along(xs), ylim = c(0, 5), type = "n")

tmp %>% 
  map_dfc(pull, fb_population_1) %>% 
  t() %>% 
  as.data.frame() %>% 
  map(log10) %>% 
  walk(lines, x = xs, col = adjustcolor("black", .25))

The facebook population of each district does not seem to change as a function of time.

And the distribution accross district looks like this:

tmp %>% 
  map(mutate, prop = fb_population_1 / n) %>% 
  first() %>%
  pull(prop) %>%
  hist2(50, xlab = "facebook coverage", ylab = "number of districts")

Let’s look at the facebook coverage as a function of the population size for the first week of the colocation data:

fb_pop <- colocation$`2020-03-03` %>%
  select(polygon1_id, fb_population_1) %>% 
  distinct() %>% 
  right_join(select(st_drop_geometry(dist_polygons), -area_km2, -den_km2), c("polygon1_id" = "polygon_id"))
 
model <-  lm(log10(fb_population_1) ~ log10(n), fb_pop)
summary(model)
## 
## Call:
## lm(formula = log10(fb_population_1) ~ log10(n), data = fb_pop)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03536 -0.19070 -0.01065  0.18854  1.33225 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.55716    0.18196  -25.05   <2e-16 ***
## log10(n)     1.48787    0.03592   41.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2796 on 690 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.7132, Adjusted R-squared:  0.7128 
## F-statistic:  1716 on 1 and 690 DF,  p-value: < 2.2e-16
fb_pop %>%
  mutate(color = ifelse(province %in% c("Hà Giang"), "red", "blue")) %>%
  with(plot(log10(n), log10(fb_population_1), col = color, axes = FALSE,
               ylim = c(1, 4.5), xlim = c(3.8, 6),
               xlab = "district population", ylab = "facebook population"))
abline(model, col = "green")
axis(1, 4:6, format(10000 * c(1, 10, 100), big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2, 1:4, format(10 * c(1, 10, 100, 1000), big.mark = ",", scientific = FALSE, trim = TRUE))

Meaning that the facebook coverage increases with population size. Note that the points corresponding to the Hà Giang province are highlighted red.

Let’s explore the relationship between the residuals of the above linear regression with the ROI/density plot derived from the above metapopulation model:

no_data_dist <- fb_pop[is.na(fb_pop$fb_population_1), ]$polygon1_id
`%notin%` <- Negate(`%in%`)
tmp <- fb_pop %>% 
  filter(polygon1_id %notin% no_data_dist) %>%
  mutate(residual = model$residuals)

The distribution of the residual values is given by:

tmp %>% 
  pull(residual) %>%
  hist2(50, xlab = "residual", ylab = "number of districts")

ROIsf %>%
  right_join(select(tmp, polygon1_id, residual), c("polygon_id" = "polygon1_id")) %>%
  ggplot(aes(x = log10(den_km2), y = ROI)) + geom_point(aes(color = residual)) +
  scale_color_continuous(low="yellow", high="blue")

The darker points represent districts with a higher, more positive residual, whereas the lighter points represent districts with a lower, more negative residual. A higher residual means that the facebook population is greater than what would be predicted from the census population, whereas a lower residual means that the facebook population is smaller than what would be predicted from the census population. We see that districts with low density but high ROI are generally characterized by more negative residuals. This is somewhat expected because a smaller facebook population relative to the census population means that a small fraction of individuals largely determine the colocation probability. If this small fraction of individuals are disproportionately connected relative to the district population as a whole, multiplying the colocation probability by the entire district population as given by the census data would result in a larger than expected ROI relative to population density.

Connectivity Web

Connecitivity web of colocation probabilities among districts.

Let’s create an sf object that contains linestrings between all possible pairs of districts in Vietnam:

all_links <- melt(coloc_mat) %>% 
  setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
  filter(polygon1_id != polygon2_id) %>%
  as.data.frame() %>%
  right_join(select(st_centroid(dist_polygons), polygon_id, geometry), c("polygon1_id" = "polygon_id")) %>%
  rename(geometry1 = geometry) %>%
  right_join(select(st_centroid(dist_polygons), polygon_id, geometry), c("polygon2_id" = "polygon_id")) %>%
  rename(geometry2 = geometry)
dist_linestrings <- st_sfc(mapply(function(a,b){st_cast(st_union(a,b),"LINESTRING")}, all_links$geometry1, all_links$geometry2, SIMPLIFY=FALSE)) %>%
  as.data.frame() %>%
  mutate(link_value = all_links$link_value) %>%
  st_as_sf()
dist_linestrings <- readRDS(coloc_line_path)

Let’s map these connections as a web:

vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

dist_linestrings %>%
  plot(lwd = 0.05, col = pal[cut(all_links$link_value, quantile(all_links$link_value, seq(0, 1, le = 11)))], add = TRUE)

The darker the link color, the stronger the link.

As a reference, the distribution of the link values between districts is given below:

hist2(log2(all_links$link_value), n = 50, 
      xlab = "log2(link value)", ylab = "number of districts")

Let’s consider only the strongest links:

vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

dist_linestrings %>%
  arrange(link_value) %>%
  tail(30000) %>%
  plot(lwd = 0.15, col = pal[cut(.$link_value, quantile(.$link_value, seq(0, 1, le = 11)))], add = TRUE)

The darker the link color, the stronger the link. Note the clustering of dark regions.

Parameterization Method 2: Facebook population

Let’s see whether the artifact disappears when we use facebook population data rather than census population data:

Let’s check whether there are any districts without facebook population data:

fb_pop[is.na(fb_pop$fb_population_1), ]
## # A tibble: 6 x 7
##   polygon1_id fb_population_1 province   district           n   lon   lat
##         <dbl>           <dbl> <chr>      <chr>          <dbl> <dbl> <dbl>
## 1      850001              NA Bình Phước Phú Riềng      97931  107.  11.7
## 2      834775              NA Điện Biên  Điện Biên Đông 59550  103.  21.2
## 3      850002              NA Hải Phòng  Bạch Long Vĩ     912  108.  20.1
## 4      850003              NA Kon Tum    Ia H' Drai      6638  108.  14.2
## 5      850004              NA Quảng Trị  Cồn Cỏ           400  107.  17.2
## 6      850005              NA Sơn La     Vân Hồ         61197  105.  20.8
no_data_dist <- fb_pop[is.na(fb_pop$fb_population_1), ]$polygon1_id

Let’s map these points:

vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

dist_polygons %>%
  filter(polygon_id %in% no_data_dist) %>% 
  st_centroid() %>%
  st_geometry() %>%
  plot(pch = 3, col = 'red', add = TRUE)

Let’s estimate the facebook population of these districts based on the linear regression given above:

data <- data.frame(n = fb_pop[is.na(fb_pop$fb_population_1), ]$n)
fb_pop[is.na(fb_pop$fb_population_1), ]$fb_population_1 <- 10^predict(model, newdata = data)

Let’s verify whether these estimates make sense:

fb_pop %>%
  mutate(color = ifelse(polygon1_id %in% no_data_dist, "red", "blue")) %>%
  with(plot(log10(n), log10(fb_population_1), col = color, axes = FALSE,
               ylim = c(1, 4.5), xlim = c(3.8, 6),
               xlab = "district population", ylab = "facebook population"))
axis(1, 4:6, format(10000 * c(1, 10, 100), big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2, 1:4, format(10 * c(1, 10, 100, 1000), big.mark = ",", scientific = FALSE, trim = TRUE))

The estimated points (colored red) are consistent with the actual population data.

The following function computes the \(R_0\) under the facebook population adjusted model:

compute_R0 <- function(dist_data, coloc_mat, standardR0, standardDistID) {
  popHanoiDist <- dist_data %>% 
    filter(polygon1_id == HanoiDistID) %>% 
    pull(fb_population_1)
  
  coloc_mat <- reshape2::melt(coloc_mat) %>% 
    setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
    dplyr::right_join(select(as.data.frame(dist_data), polygon1_id, fb_population_1), 
                      c("polygon2_id" = "polygon1_id")) %>%
    dplyr::mutate(R0 = standardR0 * link_value * fb_population_1 / 
             (coloc_mat[paste(standardDistID), paste(standardDistID)] * popHanoiDist))
}

Let’s compute the \(R_{0 ij}\) with \(R_{0 \: Hanoi'} = 18\):

R0 <- compute_R0(fb_pop, coloc_mat, 18, HanoiDistID)

Intracity \(R_0\)

Intracity \(R_0\) is defined as \(R_ii\).

Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.

intracityR0 <- R0 %>%
  filter(polygon1_id == polygon2_id) %>%
  rename(intraR0 = R0)
hist2(intracityR0$intraR0, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 30, 2))
axis(2)

cb <- RColorBrewer::brewer.pal(9, "YlOrBr")
color_generator <- colorRampPalette(cb)
pal <- color_generator(10)
hist2(intracityR0$intraR0, quantile(intracityR0$intraR0, seq(0, 1, le = 11)), col = pal, axes = FALSE,
      xlab = "Intracity R0", ylab = "density of probability")
axis(1, seq(0, 30, 2))
axis(2)

quantile(intracityR0$intraR0, seq(0, 1, le = 11))
##         0%        10%        20%        30%        40%        50%        60% 
##  0.0000000  0.4798879  0.6141574  0.7499863  0.8732814  1.0192806  1.2171663 
##        70%        80%        90%       100% 
##  1.6096531  2.7136535  4.7678751 32.4549638

Let’s map the intracity \(R_0\) values for the districts:

intracityR0sf <- dist_polygons %>%
  right_join(select(intracityR0, -polygon2_id, -fb_population_1), c("polygon_id" = "polygon1_id"))

The following district have intracity \(R0\) greater than 10.

vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

intracityR0sf %>%
  select("intraR0") %>% 
  filter(intraR0 > 10) %>%
  plot(add = TRUE, col = "red")

We see that there are two clusters: one in Hanoi Province and one in Ho Chi Minh Province. This is a significant improvement to the previous model, which considered census population data.

Let’s explore the relationship between population density and intracity \(R_0\) values under the assumption that more remote regions will have lower population densities.

ggplot(intracityR0sf, aes(x = log10(den_km2), y = intraR0)) + geom_point()

There are no longer districts with low densities that have high intracity R0 values as observed in previous formulations of the model.

Intercity \(R_0\)

Intercity \(R_0\) for location \(i\) is defined as \(\sum_{j \neq i} R_{0 ij}\).

Let’s consider the intercity R0 values:

intercityR0 <- R0 %>%
  filter(polygon1_id != polygon2_id) %>%
  rename(interR0 = R0)

totalInter <- intercityR0 %>%
  group_by(polygon1_id) %>%
  summarize(interR0 = sum(interR0)) %>%
  arrange(desc(interR0))
## `summarise()` ungrouping output (override with `.groups` argument)

Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.

hist2(totalInter$interR0, n = 50, xlab = "Intercity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 30, 2))
axis(2)

Let’s map the intracity \(R_0\) values for the districts: The following districts have intercityR0 values greater than 1:

intercityR0sf <- dist_polygons %>%
  right_join(totalInter, c("polygon_id" = "polygon1_id"))
vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

intercityR0sf %>%
  select("interR0") %>% 
  filter(interR0 > 1) %>%
  plot(add = TRUE, col = "yellow")

The districts with high intercity \(R_0\) values are clustered in high density/highly connected regions including Hanoi and Ho Chi Minh, which is what we would expect.

Let’s graph the relationship between population density and intercity \(R_0\) values:

ggplot(intercityR0sf, aes(x = log10(den_km2), y = interR0)) + geom_point()

As we would expect, population centers with higher population densities are better connected with surrounding regions resulting in higher intercity \(R_0\) values. Hence, the intercity \(R_0\) values under this model are consistent with epidemiological assumptions.

Intercity and Intracity \(R_0\) values

Let’s graph the relationship between intracity \(R_0\) values and intercity \(R_0\) values:

tmp <- intercityR0sf %>%
  right_join(select(intracityR0, -polygon2_id, -link_value), c("polygon_id" = "polygon1_id"))
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
  geom_abline(slope = 1, intercept = 0)

The grouping that we observed in the first and second model has largely disappeared. As we would expect, the intracity \(R_0\) values are weakly proportional to the intercity \(R_0\) values.

Risk of Infection

Let’s calculate it for all districts:

distROIdf <- R0 %>% 
  group_by(polygon1_id) %>%
  summarize(ROI = sum(R0)) %>%
  rename(polygon_id = polygon1_id) %>%
  arrange(desc(ROI)) %>%
  mutate(ROI = ROI / head(., 1)$ROI)
## `summarise()` ungrouping output (override with `.groups` argument)
distROIsf <- dist_polygons %>%
  right_join(distROIdf, "polygon_id")

It look like this:

distROIsf %>%
  arrange(desc(ROI)) %>%
  select(province, district, den_km2, ROI) %>%
  st_drop_geometry() %>%
  head(30)
##       province     district   den_km2       ROI
## 1  Hồ Chí Minh           10 41979.833 1.0000000
## 2  Hồ Chí Minh            3 40590.077 0.9656127
## 3  Hồ Chí Minh            5 41120.629 0.9248474
## 4  Hồ Chí Minh            4 41772.469 0.9221712
## 5  Hồ Chí Minh    Phú Nhuận 37221.566 0.8867105
## 6  Hồ Chí Minh           11 41427.819 0.8793910
## 7  Hồ Chí Minh     Tân Bình 17625.426 0.8404917
## 8  Hồ Chí Minh            1 20906.454 0.8236542
## 9  Hồ Chí Minh      Tân Phú 32256.236 0.7981342
## 10 Hồ Chí Minh   Bình Thạnh 24932.751 0.7702293
## 11 Hồ Chí Minh            6 36560.390 0.7503230
## 12      Hà Nội      Đống Đa 36694.102 0.7116461
## 13 Hồ Chí Minh       Gò Vấp 33762.451 0.7067554
## 14      Hà Nội Hai Bà Trưng 32389.591 0.6940107
## 15      Hà Nội   Thanh Xuân 30377.482 0.6708553
## 16 Hồ Chí Minh            8 21146.459 0.6351546
## 17 Hồ Chí Minh     Bình Tân 11968.405 0.6176974
## 18      Hà Nội    Hoàn Kiếm 44315.793 0.6111839
## 19      Hà Nội     Cầu Giấy 19577.057 0.6002443
## 20 Hồ Chí Minh            7  9740.528 0.5880481
## 21      Hà Nội      Ba Đình 26681.524 0.5679395
## 22      Hà Nội    Hoàng Mai  9864.983 0.5333316
## 23 Hồ Chí Minh           12  9273.140 0.4916247
## 24      Hà Nội      Từ Liêm  7104.665 0.4374471
## 25 Hồ Chí Minh      Thủ Đức 10881.460 0.4243820
## 26  Bình Dương        Dĩ An  6111.717 0.3764839
## 27      Hà Nội      Hà Đông  5997.269 0.3671120
## 28      Hà Nội       Tây Hồ  6257.055 0.3611581
## 29     Đà Nẵng    Thanh Khê 21657.081 0.3584673
## 30 Hồ Chí Minh            2  3414.600 0.3573184

The districts with the highest risk of infection are located in Hanoi and Ho Chi Minh provinces. There are no longer any districts with low population densities but relatively high risks of infection.

hist2(distROIsf$ROI, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 1, 0.1))
axis(2)

Let’s map the risk of infection to the districts:

distROIsf %>%
  st_geometry() %>%
  plot(lwd = .1, col = pal[cut(distROIsf$ROI, quantile(distROIsf$ROI, seq(0, 1, le = 11)))], main = NA)

vn0 %>%
  st_geometry() %>%
  plot(add = TRUE)

Let’s examine the relationship between population density and risk of infection:

ggplot(distROIsf, aes(x = log10(den_km2), y = ROI)) + 
  geom_point() +
  geom_point(data = filter(distROIsf, province == "Hà Giang"), aes(x= log10(den_km2), y = ROI), color='red')

Generally, risk of infection increases with population density. Note that the districts in the Hà Giang province no longer have have particularly high ROI for their relatively low population densities as highlighted in red.

From Districts to Provinces

Let’s aggregate the colocation data from district-level to province-level: \[ \begin{aligned} R_{0 ij} = R_{0 \text{ Hồ Chí Minh}} \frac{\sqrt{\sum_{k \in P_i} \sum_{l \in P_j} C_{kl} N_l}}{\sqrt{\sum_{k \in P_{HCM}} \sum_{l \in P_{HCM}} C_{kl} N_l}} \end{aligned} \] where \(R_{0 \text{ Hồ Chí Minh}}\) is the reference \(R_0\) assigned a value of 18 and \(P_i\) is the set of districts in province \(i\). The numerator is the square root of the average number of individuals from province \(j\) that colocate with an individual from province \(i\) per unit time, whereas the denominator is the square root of the average number of colocations among individuals in Hồ Chí Minh.

Let’s compute the \(R_{0 ij}\) values with the standard reference province of Hồ Chí Minh assigned an \(R_0\) of 18.

prov_coloc <- reshape2::melt(coloc_mat) %>%
  setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
  dplyr::right_join(select(fb_pop, polygon1_id, province), "polygon1_id") %>%
  rename(province_1 = province) %>%
  dplyr::right_join(select(fb_pop, polygon1_id, province, fb_population_1),  c("polygon2_id" = "polygon1_id")) %>%
  rename(province_2 = province, fb_population_2 = fb_population_1) %>%
  mutate(coloc_num = link_value * fb_population_2) %>%
  group_by(province_1, province_2) %>%
  summarise(coloc_value = sum(coloc_num))
## `summarise()` regrouping output by 'province_1' (override with `.groups` argument)
ref_prov <- prov_coloc %>%
  arrange(desc(coloc_value)) %>%
  head(1) %>%
  pull(coloc_value)
prov_coloc %<>% mutate(R0 = 18 * sqrt(coloc_value) / sqrt(ref_prov))

The following code summarizes the province-level population and geographical data from GADM/census data set:

prov_polygons <- dist_polygons %>%
  group_by(province) %>%
  summarise(total_area_km2 = sum(area_km2), total_pop = sum(n), geom = st_union(geometry)) %>%
  mutate(total_den_km2 = total_pop / total_area_km2) %>%
  select(province, total_area_km2, total_pop, total_den_km2, geom)
## `summarise()` ungrouping output (override with `.groups` argument)

Intracity \(R_0\)

Intracity \(R_0\) is defined as \(R_{0 ii}\).

Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.

intracityR0 <- prov_coloc %>%
  filter(province_1 == province_2) %>%
  rename(intraR0 = R0) %>%
  select(province_1, intraR0) %>%
  rename(province = province_1)
hist2(intracityR0$intraR0, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 30, 2))
axis(2)

The provinces with the largest intra-province \(R_0\) values are:

intracityR0 %>%
  arrange(desc(intraR0)) %>%
  head(10)
## # A tibble: 10 x 2
## # Groups:   province [10]
##    province    intraR0
##    <chr>         <dbl>
##  1 Hồ Chí Minh   18   
##  2 Hà Nội        12.9 
##  3 Đà Nẵng        5.73
##  4 Hải Phòng      5.42
##  5 Bình Dương     5.12
##  6 Đồng Nai       4.30
##  7 Thanh Hóa      4.25
##  8 Cần Thơ        3.79
##  9 An Giang       3.78
## 10 Bắc Ninh       3.71

Let’s map the intracity \(R_0\) values for the provinces:

intracityR0sf <- prov_polygons %>%
  right_join(intracityR0, "province") %>%
  select(province, intraR0, everything())
intracityR0sf %>%
  st_geometry() %>%
  plot(lwd = .1, col = pal[cut(intracityR0sf$intraR0, quantile(intracityR0sf$intraR0, seq(0, 1, le = 11)))], main = NA)

vn0 %>% 
  st_geometry() %>% 
  plot(add = TRUE)

Intercity \(R_0\)

Intercity \(R_0\) for location \(i\) is defined as \(\sum_{j \neq i} R_{0 ij}\).

Let’s consider the intercity R0 values:

intercityR0 <- prov_coloc %>%
  filter(province_1 != province_2) %>%
  dplyr::rename(interR0 = R0)
totalInter <- intercityR0 %>%
  group_by(province_1) %>%
  summarize(interR0 = sum(interR0)) %>%
  arrange(desc(interR0)) %>%
  rename(province = province_1)
## `summarise()` ungrouping output (override with `.groups` argument)

Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.

hist2(totalInter$interR0, n = 50, xlab = "Intercity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 30, 2))
axis(2)

Let’s map the intracity \(R_0\) values for the provinces:

intercityR0sf <- prov_polygons %>%
  right_join(totalInter, "province") %>%
  select(province, interR0, everything())
intercityR0sf %>%
  st_geometry() %>%
  plot(lwd = .1, col = pal[cut(intercityR0sf$interR0, quantile(intercityR0sf$interR0, seq(0, 1, le = 11)))], main = NA)

vn0 %>% 
  st_geometry() %>% 
  plot(add = TRUE)

The districts with high intercity \(R_0\) values are clustered in high density/highly connected regions including Hanoi and Ho Chi Minh, which is what we would expect.

Let’s graph the relationship between population density and intercity \(R_0\) values:

ggplot(intercityR0sf, aes(x = log10(total_den_km2), y = interR0)) + geom_point()

As we would expect, population centers with higher population densities are better connected with surrounding regions resulting in higher intercity \(R_0\) values. Hence, the intercity \(R_0\) values under this model are consistent with epidemiological assumptions.

Intercity and Intracity \(R_0\) values

Let’s graph the relationship between intracity \(R_0\) values and intercity \(R_0\) values:

tmp <- intercityR0sf %>%
  right_join(intracityR0, "province")
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
  geom_abline(slope = 1, intercept = 0)

Note that a large fraction of the inter-province \(R_0\) for the provinces is attributable to connections with Hồ Chí Minh and Hà Nội:

Histogram of the percent inter-province ROI attributable to connections with Hồ Chí Minh and Hà Nội:

intercityR0 %>%
  right_join(totalInter, c("province_1" = "province")) %>%
  rename(totalInter = interR0.y, interR0 = interR0.x) %>%
  mutate(tmp = interR0 / totalInter * 100) %>%
  filter(province_2 %in% c("Hồ Chí Minh", "Hà Nội")) %>%
  group_by(province_1) %>%
  summarise(perTotal = sum(tmp)) %>%
  arrange(desc(perTotal)) %>%
  pull(perTotal) %>%
  hist2(., n = 50, xlab = "Percent interROI from Hồ Chí Minh/Hà Nộ", ylab = "number of districts", axes = FALSE)
## `summarise()` ungrouping output (override with `.groups` argument)
axis(1, seq(0, 100, 10))
axis(2)

Relationship between inter-province \(R_0\) and intra-province \(R_0\) with Hồ Chí Minh and Hà Nội connections removed:

intercityR0sf <- prov_coloc %>%
  filter(province_1 != province_2) %>%
  dplyr::rename(interR0 = R0) %>%
  filter(province_2 %notin% c("Hồ Chí Minh", "Hà Nội")) %>%
  group_by(province_1) %>%
  summarize(interR0 = sum(interR0)) %>%
  arrange(desc(interR0)) %>%
  rename(province = province_1)
## `summarise()` ungrouping output (override with `.groups` argument)
tmp <- intercityR0sf %>%
  right_join(intracityR0, "province")
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
  geom_abline(slope = 1, intercept = 0)

Risk of Infection

Let’s calculate it for all districts:

ROIdf <- prov_coloc %>% 
  group_by(province_1) %>%
  summarise(ROI = sum(R0)) %>%
  dplyr::rename(province = province_1) %>%
  arrange(desc(ROI)) %>%
  mutate(ROI = ROI / head(., 1)$ROI)
## `summarise()` ungrouping output (override with `.groups` argument)
ROIsf <- prov_polygons %>%
  right_join(ROIdf, "province")

It look like this:

ROIsf %>%
  arrange(desc(ROI)) %>%
  select(province, total_den_km2, ROI) %>%
  st_drop_geometry() %>%
  head(30)
## # A tibble: 30 x 3
##    province    total_den_km2   ROI
##    <chr>               <dbl> <dbl>
##  1 Hồ Chí Minh         3825. 1    
##  2 Hà Nội              2286. 0.751
##  3 Đà Nẵng             1113. 0.323
##  4 Bình Dương           648. 0.321
##  5 Hải Phòng           1453. 0.317
##  6 Thanh Hóa            337. 0.308
##  7 Long An              367. 0.306
##  8 Đồng Nai             529. 0.284
##  9 Cần Thơ              888. 0.272
## 10 Bắc Ninh            1533. 0.269
## # … with 20 more rows
hist2(ROIsf$ROI, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 1, 0.1))
axis(2)

Let’s map the risk of infection to the provinces:

ROIsf %>%
  st_geometry() %>%
  plot(lwd = .1, col = pal[cut(ROIsf$ROI, quantile(ROIsf$ROI, seq(0, 1, le = 11)))], main = NA)

vn0 %>%
  st_geometry() %>%
  plot(add = TRUE)

Province-level ROI vs. aggregation of district-level ROI

We want to compare the province-level risk of infection with an aggregation of the district-level risk of infection:

First, aggregate the district-level ROI into provinces by taking mean ROI for each province:

tmp <- distROIdf %>%
  right_join(select(as.data.frame(dist_polygons), province, polygon_id), "polygon_id") %>%
  group_by(province) %>%
  summarise(totalROI = sum(ROI)) %>%
  arrange(desc(totalROI)) %>%
  mutate(adjROI = totalROI / head(., 1)$totalROI)
## `summarise()` ungrouping output (override with `.groups` argument)

Let’s collate the two sets of data:

bothROI <- ROIsf %>%
  right_join(select(tmp, province, adjROI), "province") %>%
  select(province, total_pop, ROI, adjROI, everything()) %>%
  arrange(desc(ROI))

Let’s see how they compare:

bothROI %>% 
  ggplot(aes(x = ROI, y = adjROI)) + 
  labs(x = "Province-level ROI", y = "District-adjusted Province ROI") + 
  geom_point()

They appear generally consistent. It is worth noting, however, that the province-level ROI tend to be higher than the district-adjusted province-level ROI. That is, the risk of infection of the district as a whole is generally considered greater than the sum of the risk of infection of the individual districts that make up the province.